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Abstract 

It is common practice to partition complex workflows into separate 
channels in order to speed up their completion times. When this is done 
within a distributed environment, unavoidable fluctuations make individ¬ 
ual realizations depart from the expected average gains. We present a 
method for breaking any complex workflow into several workloads in such 
a way that once their outputs are joined, their full completion takes less 
time and exhibit smaller variance than when running in only one channel. 
We demonstrate the effectiveness of this method in two different scenar¬ 
ios; the optimization of a convex function and the transmission of a large 
computer file over the Internet. 
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It is well known that the partition of large workflows into smaller workloads 
can often accelerate the completion of the full process. This is the assumption 
underlying the parallelization of large computer jobs, such as the use of map- 
reduce for indexing documents [T], parallel algorithms for machine learning 
uma, decentralization of load balancing in networks lilZlIH] and many other 
complex processes liiiniin]- 

Beyond computer algorithms, other examples of large workflows that can 
be partitioned into smaller workloads are the transmission of big files over the 
Internet m, the processing of very large printing jobs using more than one 
printer, the introduction of additional roads in urban traffic [Ullllli] and the 
breakup of manufacturing processes into parallel streams m- In all these cases, 
once all the workloads are processed, the results are pieced together to produce 
a useful output. 

The parallelization procedure entails a decision on how to partition the work- 
flow so that its full completion process takes the shortest time with minimum 
uncertainty. Uncertainty is a relevant and important variable because of the 
unavoidable fluctuations in processing a workload that each processing unit, 
channel or virtual machine undergoes when having to time share with other 
processes. This introduces a stochastic component into the execution of any 
program, which at times can actually increase the time it takes for a given 
workflow to finisl3. Thus the need to incorporate these fluctuations into the 
partitioning procedure, so that the overall workflow completes in shorter times 
than the original non-partitioned one. 

In what follows, we describe a novel procedure for breaking any complex 
workflow into several workloads in such a way that once their outputs are joined, 

^This is unlike map-reduce, which splits the execution inputs into equal parts. While 
map-reduce can minimize execution times it does not necessarily minimize uncertainty. Thus, 
while on average, execution times are reduced, single instances can still take very long times 
to process. 
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their full completion takes less time and uncertainty than when running it in only 
one processor. The procedure is based on notions of risk from economics that are 
used to combine primitive algorithms into new programs that are preferable to 
any of the primitive ones HZIIIH]. In our case however, we focus both on speeding 
up the completion time and lowering the uncertainty of the joint execution of 
the complementary workloads. This implies that the overall processing time of 
the full program is determined by the longest running process. After presenting 
the method, we demonstrate its effectiveness in two different scenarios; the 
optimization of a convex function and the transmission of a large computer file 
over the Internet 

In order to handle this problem, consider a workflow D, partitioned into two 
workloads Di and Dj, each of which computes on different machines i and j, 
with different processing speeds and fluctuating performances. Once the slowest 
workload has completed, the two outputs are joined together and the workflow 
is considered complete. 

For simplicity in the exposition, we will assume that the completion time, ti 
for the full workflow D executing on machine i is a continuous variable which 
is Normally distributed with mean fii and standard deviation cr^. 




If the workload on machine i, Di, is smaller than H by a factor of /, i.e. 
\Di\ = f\D\, the resulting distribution of completion times ti for machine i is 
given by, 

p{U\Di,p„a,) ~ A/'(//ri,[/cri]^) 

and similarly for machine j that processes workload Dj, so that \Dj\ = (1 — 
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p{tj\Dj,fij,aj) - AT (^[1 - [(1 - /)crjf) 


The workflow only completes when both machines i and j hnish processing 
their assigned workloads Di,Dj, respectively. Thus, the cumulative density 
function for the completion time t is the probability that both workloads ti and 
tj complete within a time e. 

P{t < e\f,D,fXi,ai,p.j,aj) = P{ti < e\Di,iii,ai) ■ P{tj < e\Dj,pj,aj) (1) 


The decision as how to partition the workflow consists in choosing the value 
of / such that the workflow will execute with the lowest expected completion 
time p{f) and variance a'^{f). This requires understanding the behavior of p 
and as a function of /. They can be derived from their probability density 
function as, 


nOC 

fj'{f) = E{t\e)= t-p{t\e)dt 

Jo 

pOO 

a^{f) = Var{t\e) = ■ p{t\<d) dt - [£;(t|0)]^ 

Jo 


with 


0 = 


with the probability density function given by the first order derivative of the 
cumulative density function shown in Equation [TJ 


p{t\e) = jP{t' < t|0) 


Since there is no closed form solution for the probability density function, we 
express the expected completion time /i(/) in terms of the cumulative density 
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function shown in Equation [TJ 



1 — P{t < e|0) de 


Similarly, the variance cr^(/) is given by, 





1-P{t< e|0) 





(a) with respect to / 


(b) cr^ with respect to / 


Figure 1: Figures fTal and [Tbl show ^ and cr^ as a function of /. The values used 
are = 30, ai = 2, = 20, aj = 6. The bolded red • is the efficient frontier 

that provides the best combinations of n and a^. The value of / which gives 
the minimum point in each of (lla|) and m is different, and that results in a 
range of values that / can possibly take. 


Figures [T^ and [Tb] show the behavior of the expected time p, to completion of 
a workflow and its variance both as a function of /; while Figure [5] shows fx 
and <j parametrically as a function of each other. As can be seen, the partition 
of the workflow results in completion times and variances that can be much 
smaller than the original unpartitioned ones. Moreover, the minima for ^ and 
(7^ occur for different values of /, a fact that determines a range of choices. 

Since the resulting curve in Figure [5] is parabolic, some values of ^ have two 
possible choices of and vice-versa. If our assumptions on the statistical distri- 


5 









30 


28 

26 

24 

S 22 
20 

18 

16 

14 

0 5 10 15 20 25 30 35 40 

O^if) 

Figure 2: Parametric plot of /i and for /li = 30, at = 2, = 20, Uj = 6. 

The bolded red • corresponds to the efficient frontier in Figures ITal and ITbl 

bution of completion times for the two parallel workloads hold, the theoretical 
results derived in Figure [5] allows us to decide the appropriate value of / which 
minimizes /i and for the full workflow execution. 

This methodology is general enough so as to be applicable to a number 
of scenarios. In what follows we illustrate this approach with two concrete 
examples that can be easily tested in the laboratory. The first one is the math¬ 
ematical optimization of a convex function, while the second corresponds to the 
transmission of a large file over the Internet. 

We first demonstrate the parallel optimization of a least squares error func¬ 
tion used for logistic regression classification. This function is quadratic and 
therefore convex. This is different from a parallel algorithm such as map-reduce, 
which breaks the file into an equal number of smaller inputs. In our case the 
input data D to the convex function is partitioned into two workloads of un¬ 
equal sizes Di and Dj. A classical optimization algorithm |19] is then applied to 
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each of the workloads Di and Dj to obtain globally optimal solutions 9i and 9j , 
based on each of their inputs. The desired solution 9 for the original workflow 
D was then obtained as a linear combination of the solution from each of the 
workloads. 

9 = f9, +{I- f)9, 




(a) fj, with respect to / (b) cr^ with respect to / 

Figure 3: fi and cr^ as a function of / for parallel optimization of the convex 
least squares error function. 

We processed the parallel optimization algorithm on two virtual machines, 
each with one CPU core running at 2667 MHz. To generate uncertainty in 
the completion time of each workload, we ran background processes on each 
machine, which created contention for CPU resources. 

Figures I3a1 and iTbl show how the mean completion times and their variances 
vary with each value of /. The mean and variance at each value of / was 
obtained by repeating many trials of the optimization process over a long period 
of time using different values of /. Figure |4] shows /r and parametrically as 
a function of each other for this parallel optimization case. As can be seen, one 
obtains a performance curve similar to the theoretical one in Figure [H More 
importantly, the results clearly show that both the total completion time and 
its variability are much lower than the original unpartitioned workflow. This 
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Figure 4: Observed values of and cr^(/) as a parametric function of each 
other. 

implies that one can always choose a partition (given by the value of /) such 
that it lowers both the completion time of the computation and its uncertainty. 

Next, we performed a file transmission experiment by transferring a fixed 
size file in parallel from a source node to a destination node over two network 
paths. Besides its intrinsic value, this experiment also acts as a proxy for other 
spatial workflows which are harder to test in the laboratory, such as urban traffic 
or transportation routes. For our file transmission experiment, since the TCP 
network protocol does not allow fine grain control of how the file packets travel 
through the Internet, we created an intermediate overlay to redirect a fraction 
of the file packets through a different path. 

The source node used in our experiment was hosted in New York City, while 
the destination node was hosted in Singapore. The use of traceroute showed 
that network packets went through the west coast of the US before reaching 
the destination in Singapore. This implies that network packets from New York 
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City route through the Pacific Ocean to Singapore. 

We wanted to determine if having an alternate route for sending some of the 
file packets from New York to Singapore via Europe provided a better trans¬ 
mission process. We thus created another host in London to act as an overlay 
which received file packets from New York and forwarded them to Singapore. 

We split a large file into two workloads whose sizes depended on the prdiffer- 
ent values of /. and sent each of them across the two different network channels. 
We ensured that only network transmission times contributed to the completion 
times by ignoring disk I/O delays. 
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Figure 5: Histogram of completion times at / = 0.5 for the one workload. 
Due to inherent fluctuations in the network pipelines, the completion time 
for a file of fixed size was Normally distributed around a mean and vari¬ 
ance. This distribution of completion times was consistent for the values of 
/ S {0.0,0.1,0.2,..., 1.0} during our experiments. 


To measure the completion time of the two parallel file transfers, the node at 
the destination measured the time of the last packet (from either channel) and 
then subtracted the time of the request for the first packet (from both channels). 
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(a) n with respect to / (b) cr^ with respect to / 

Figure 6: /i and as a function of / for dual transmission of a file. 

In order to measure the mean and variance of the transmission times, we 
repeated the file transfer 20024 times over a period of 72 hours from Sunday to 
Tuesday. For each trial, we randomized the value of /. 

Figure [5] shows the distribution of completion times for the value / = 0.5, 
which is well approximated by a Normal distribution. Figures [6a] and l6bl show 
how the mean completion times and their variances varied as a function of /. 
Similar to the optimization case, the results for this file transmission experiment 
are also consistent with the theoretical predictions shown in in Figures [1^ and 

[TR 

These results show that this general methodology for partitioning uncertain 
workflows leads to shorter expected completion times with reduced uncertainty. 
All is needed after obtaining such a curve is to decide on the value of / that 
lowers uncertainty and expected completion time. A very direct application of 
this method would be in the information technology domain, as it allows for new 
formulations of pricing schemes for Quality-of-Service (QoS) [101 HI] offerings, 
since in order to satisfy demand large cloud and data systems need to increase 
the speed with which they process incoming jobs. 

There are several obvious extensions of this work. Unlike the scenarios we 
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studied, where the statistical properties of the system are known, one often 
encounters situations where the processing capabilities of the systems have to be 
estimated on-the-fly. Methods based on Bayesian inference during deployment 
[22] would then provide the distribution in completion times that are needed to 
partition a given workload. Moreover, one can generalize the splitting procedure 
to very many components.In that case, methods like group testing [521 [H] could 
be utilized to decide on the best choice of the number of components. 

Finally, we stress that the applicability of this method extends beyond the 
execution of computer algorithms and file transmissions over the Internet. Al¬ 
leviating congestion in urban traffic, job scheduling in manufacturing, finding 
optimal routes for supply chain scenarios and any other activities that allow for 
some parallelism can also exploit this approach. 
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